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ABSTRACT 

Diagnostic testing confronts several challenges at 
once, among which are issues of test interpretation and immediate 
modification of the test itself in response to the interpretation. 
Several methods are available for administering and evaluating a test 
in real-time, towards optimizing the examiner's chances of isolating 
a persistent pattern of erroneous performance by a student. Under 
idea** circumstances, a student who misunderstand^ the test content 
wou be identified early in a testing sequence; from this point the 
test ..ould be tailored to estimates not only of ability but also (or 
instead) to the relative likelihoods of a set of competing diagnostic 
hypotheses that could account for the student's behavior (ability). 
Jt^ms whi<:h^ could discriminate among these hypotheses could be 
administered in increasingly well-bounded subsets until a specified 
stopping rule is met. The following models for this procedure are 
described and compared: Wald's sequential probability ratio test, 
Sixtl's modified binomial method, Choppin's Catenating Bayesian 
method. Fink and Galen's decision path method, Shortlif.fe and 
Buchanan's inexact reasoning method, Kmietowicz and Pearson's ranked 
probability method, and Schum's cascaded inference method. (BW) 
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Interpreting the Results of Diagnostic Testing: 
Some Statistics for Testing in Real Time 



by 

David McArthur and Chih-Ping Chou 
Introduction 

Diagnostic testing in education, as in a variety of other fields, 
confronts several challenges at once, among which are issues of test 
interpretation and immediate modification of the test itself in response to 
the interpretation. This paper explores a set of methods for administering 
and evaluating a test in real-time, towards optimizing the examiner's 
chances of isolating a persistent pattern of erroneous performance by a 
student. What is expected from these methods? What does each method take 
into account in the testing process? How do they compare with each other? 

For well over half a century the diagnostic value of interpreting a 
student's choice of a particular wrong answer to a test item has been 
appreciated (Pressey, 1926). Contemporary test specialists point to the 
measurement strength inherent in formulating tests for which the item 
distractors carry specific meanings for the appraisal of student abilities 
and disabilities (Roid & Haladyna, 1982). The rapid development of comput- 
er technology in the last decade has almost eliminated the practical re- 
strictions on such testing. However, the overwhelming predilection con- 
tinues in favor of correct/incorrect response scoring. The probative value 
of a wrong response — that is, its signifir -'ice for or against one or 
another of a set of plausible diagnostic hypotheses is totally obviated 
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by conventional 0/1 scoring algorithms. Yet it is «axactly that probative 
value which is central to forming diagnostic appraisals. 

What is being sought in diagnostic testing is some cohesive pattern of 
wrong answers, a pattern of individual student responses which reveals a 
characteristic signature or diagnostic profile. Diagnostic profiles are an 
integral aspect of many psychological tests: a trained examiner probes 
with increasing selectivity and specificity until a meaningful 
psychological pattern appears. In recent work in projective testing, the 
initial response of the examinee to a stimulus card is codified, the 
ensuing Inquiry is guided by estimates about the psychological dimensions 
of the problem as shown by that codifying, and the examinee's responses to 
that inquiry are used to refine and solidify one or another diagnostic 
inference (McArthur & Roberts, 1982). This honing procedure proceeds in an 
adaptive sequence based In part on technical guidelines, in part on the 
examinee's consistency (or lack thereof) in responding to the stimulus, and 
in part on the examiner's inferences of the strength of the present 
evidence and the benefit of continued testing. 

Under highly idealized circumstances, the disability of a student who 
is engaging consistently in a certain misunderstanding of the test content 
would be identified early in a testing sequence by the astute observer 
(human or computer); from this point the test then could be tailored to 
estimates not only of ability (9) but also (or perhaps instead) the 
relative likelihoods of a set of competing diagnostic hypotheses {Hi, H2, 
H3...}. Items whose distr?ctors would assist in discriminating amonn the 
plausible competing H's for that student's behavior could be administered 
to the student in increasingly well -bounded subsets until one or another 
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stopping rule Is met. Briefly, the optimal stopping rule would be one 
which maximizes the likelihood of a single primary diagnostic hypothesis, 
supported by sufficient estimation strategies and by exactly the right 
amount of evidence. The evidence is not so much as to be unnecessarily 
redur>dant, and not so little as to be insufficiently discriminatory, not so 
^^dyffiQuM that the student simply flounders and not so easy that the 
examiner misses the problem altogether. This task is by its nature a 
compound probabilistic undertaking, although the flow chart which 
schematically Illustrates this task is relatively simple (see Figure 1). 



Figure 1 

Schematic flow of a generalized response - contingent test 
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The flow of a response-contingent test is governed by two implicit 
prerequisites. The first is that a finite set of suitable hypotheses is 
represented by the test. The hypotheses are appropriate to age-level, 
intellectual functioning and motor capabilities of the target student. The 
hypotheses are orderly, in the sense that they are either at a uniform 
level of abstraction and mutually independent, or they fit an explicit 
hierarchy or cascade and are mutually dependent upon one another. The 
second implicit prerequisite is that a given item or set of items be 
closely linked to at least two competing hypotheses. A response must be 
able to be evaluated in terms which tie the response to one hypothesis but 
mismatch another; the response cannot be considered probative unless these 
links can be made at the time the response is given. 

As the test is administered four decisions must be made in real-time. 
1; Is the response probative? If not probative, further decisions 
regarding discrimination among competing hypotheses are obviously moot; 
questions must be asked as to the appropriateness of the item given, item 
selection criteria, and for the original hypotheses, then another item or 
item set readministered. 2) Is any one of the stopping rules in use met? 
If a stopping rule applies, it signifies that the examiner has reached an 
applicable criterion, so further testing is not warranted.^ 3) Are there 
remaining items to administer, or remaining hypotheses for which one or 
more stopping rules have not been met? If either answer is negative, there 
is nothing to be gained by further testing in the context of the present 

1 This assumption holds if the examiner considers stopping rules 

disjunctive. If stoppTrTg rules are considered conjunctive, then the 
question is answered in the negative until all associated stopping rules 
are met -- with, of course, a larger volume of responses and presumably, 
though not necessarily, an increased discriminative power. 
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test. The presence or absence of one or more supported hypotheses, and the 
costs of continued testing with additional Instruments, govern the 
examiner's decision at this juncture. 4) Should Item selection criteria be 
changed? If no hypothesis is supported, and If a bank of Items and 
hypotheses remain, a decision must be made as to whether the sequence of 
administration continues to be appropriate. Explicit branching can occur 
here* interactive tests use this decision point to change topics,, item 
complexity, and/or task requirements to enhance the expected likelihood of 
hypothesis discrimination. It is this decision point which allows the 
examiner to maximize Inferences in regard to diagnostic hypotheses. 

With very rare and specialized exceptions, diagnostic testing in 
education seldom enables the test interpreter to build on inferential 
strategies with respect to Individual test performance. Moreover, a ' 
variety of theoretical and practical problems appear to have plagued devel- 
opments along this line. Among the problems that arise in the pursuit of 
interpretable patterns is the difficulty In obtaining diagnostic 
performance clusters from raw data without a prior set of likelihood 
estimates for a small and workable number of competing hypotheses. ^ 

The problem of assigning meaning to cohesive patterns of response 
reduces in its simplest form to two elements: limiting the number of 
observations we need to take, and limiting the number of possible 

1 The number of possible clusters m which can be made out of n 
observations is a Stirling number: 



Unfortunately, even for a handful of observations, this term can be 
exceedingly large. 
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meaningful clusters Into which we will place observations. A test ought to 
provide enough range to evaluate fairly a highly varied set of possible 
examinees, without building such a long test that any of the protagonists 
— examinee, test administrator, test interpreter, or test designer is 
exhausted by the process. The diagnostic process ought to involve checking 
in real -time as to whether any payoff remains for administering more test 
items. Is the performanre of the examinee at this moment in time 
sufficient in quantity and "cohesion" for us to draw a suitable diagnostic 
Inference? 

A simple stopping rule for diagnosis takes the following form: go no 
further because any one of several probabilistic boundaries is met. Among 
the set of allowable hypotheses, one diagnostic hypothesis has emerged in 
the "lead." One possibility for limiting observations and limiting 
clusters simultaneously is to avoid that Stirling number by picking an easy 
criterion, a low threshold of confidence, and a small number of allowable 
hypotheses. Alternatively, we can limit the number of possible clusters 
for diagnosis to exactly two, so students must select one option or the 
other; the stopping rule becomes: go no further when one hypothesis 
obtains a simple majority of examinee responses. 

Foremost among the difficulties of using the stopping rule approach to 
limit observations and clusters is the extreme paucity of situations in 
educational or psychological testing for which a strict parsimony of 
hypotheses can be formulated. Another is the reasonable assurance that 
some students will guess some of the time on some items. Yet another 1s 
the degree of confidence one places in a single response as a marker of a 
general pattern of responses; a test item, after all, is seldom adequate as 
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a mirror of a student's true understanding. Other difficulties ar>sfe in 
regard to assessment of the several probabilities that contribute to the 
flow of the test: they include problems with probabilistic comparison 
baserates, fuzziness in the Bayesian priors, and inherent objections to 
traditional Bayesian probabilistic analysis itself. 

None of the problems stated here is Insurmountable. Theoretically 
useful probabilistic algorithms for diagnostic inference are found in 
several professions. This paper sets out six algorithms which have bearing 
on the interpretation of response patterns and diagnosis. Two are drawn 
from probabilistic methods in educational testing -- Sixtl's modified 
binomial and Choppin's catenating Bayesian Methods; two are drawn from 
recent developments in medical diagnostic studies — Fink and Galen's 
decision path analysis and Shortliffe's inexact reasoning; one rests in 
decision theoretic analysis Kmietowicz's ranked probabilities; and one 
builds on a Baconian probabilistic appraisal — Schum's "cascaded 
Inference," which has been studied primarily in the context of decision 
making in jurisprudence. Each of the six methods will be placed into a 
common notation, and a comparison made between the advantages and 
disadvantages of each, with Special attention to the restrictive nature of 
prerequisites and the relative strength of the stopping rules. Not all of 
these approaches are equivalent in scope, nor do they have analogous 
assumptions about the patterns which are being Isolated from the raw data. 
It is also important to note at the onset that the stronger inferential 
procedures inevitably Impose more restrictive conditions on the user. 
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ESSENTIALS FOR PROBABILISTIC EVALUATION OF DIAGNOSES 
In the present discussion, the following terins nre useci throughout: 

-^h} the set of alternative hypotheses H2» H3...Hnij (includes 
^^correct- 

the 1^^ hypothesis of -fh} , contained In one or more 
alternatives for one or more Items 

P(H^) the prior probability of Hi 

Xj the examinee's response 'at a given step s 1n the testing 
sequence (generally a response to a single Item which 
represents a choice of Hi. from the set of {«JX 

Hj those hypotheses shown to the examinee In an item but not 
sel ected 

Hi those hypotheses not shown to the examinee 1n the Item, so not 
selectable at this step 

k the number of hypotheses contained in an Item's answer choices 

m the number of hypotheses In all (m ^ k) 

n the number of attempts made by the examinee (n ^ x) 

Diagnostic testing Involves several key terms made up of the above 
entries. The general form of the stopping rule Is the following: 
At a given step s In the sequence of the test, does the accumulated 
evidence which suggests H^ exceed the accumulated evidence [x-^ 
which relates instead to Hj and H^ The accumulation of evidence on 
both sides is treated probabilistically, and the likelihood ratio that 
results from dividing one into the other is assessed against an allowable 
lower and upper limit. Wald's (1947) sequential probability ratio test 
(SPRT) is the earliest treatment of this stopping question: 



n 
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A number if studies have applied Wald's (1947) sequential probability 
ratio test to the task of test individualization (cf. Ferguson, 1969, 
1973). The SPRT, predicated on Bayesian methodology, is well -understood 
but clearly does not begin to account for the variety of factors which 
contribute to examinee performance. 

Gorry and Barnett (1968) showed that sequential diagnostic testing 
involves a compounding of conditional probabilities as follows: 



where fl Implies a logical "and" using the entire set c^f behaviors 
acquired to date -|*X} evidence. To be successful, these approaches 
require extensive knowledge of prior conditionals and interrel atlonshipSr 
They are fairly impractical except in highly controlled environments. The 
various techniques which follow are to be viewed as approximations of these 
data-intensive methods. ^ 

TECHNIQUES FROM EDUCATIONAL RESEARCH 

Modified Binomial Method 

A decade ago4 German educational researcher published a paper on 
automated test administration which Included an application of Bayesian 
probabilistic analysis with correction for guessing. Sixtl (1974) 
presented a formula for a stopping rule which acknowledges the roles of 
item answer alternatives, and is readily identified as a classical Bayesian 
approach to forming a likehood function for a particular hypothesis. 
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Sixtl's likelihood ratio, when modified to cover each diagnostic hypotheses 
of a set of hypotheses, reads as/oHows: 



k 



SIxtVs approach involves selection of a Bayeslan prior for each 
diagnostic hypothesis, Involing a simple correction for guessing, and 
construction In real-time of the likelihood function A for the hypothesis 
to fit the stopping rule 



1- cA 



where ok and are conventional measures of significance and power, 
respectively. Figure 2 presents a schematic illustration of this method. 



i 



ERIC 



- 11 - 



Figure 2 

Sequential testing - binomial model , 
multiple hypotheses in {h} 




Immediate objections can be made to Sixtl's approach. First, the 
model of guessing is simplistic; it allows only a constant term for a 
function that is unlikely to be stable across items and respondents. 
Second, the fixed nature of the Bayesian priors must be chosen to reflect 
{h} yrlthout-regard to context or sequence effects. Third, Sixtl's 
procedure fails to use all of the information gained at a given moment to 
form an updated chain of hypothesis evaluation. 
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Catenating Bayesian Method 

A sequential system for response contingent diagnostic testing was 
proposed by Choppin (McArthur 4 Chopp1n,1983) using both Bayesian priors 
and conditionals to form a continuously updated probability assessments for 
diagnostic hypotheses in real time. Choppin' s approach, modified slightly 
to reflect iterative cycles through {h}, is: 



^1 rC<x 









The computation is predicated on a catenating sequencing of conditionals: 
initially it requires priors for Xi and Xj assuming Hi is true. Each 
is updated upon the examinee's next selection, such that Choppin's f in 
Bayesian terms is a catenating conditional ratio appraisal. Use of the 
Shannon entropy function (qie^erjm) , which in this context is 

s.e.r. « ^ ^ T;, 

simplifies the output of the catenating method by concluding at each step 
with a single expression for the uncertainty remaining in the set of 
proportions. The largest decline in S.E.F. denotes an optimal stopping 
for the sequence; the largest p(H) at that step is taken as an optimal H 
for that respondent. Figure 3 shows this procedure ac work. 
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Figure 3 

Sequential of testing — catenated Bayesian model, 
multiple hypotheses In 




An Inclusive conditional probability fC^J'^j) represents the 
probability that selection 1 would be made when Hj Is true, a selection 
which could be made for a wide variety of reasons. One immediate objection 
to Choppln's F Is that the catenation Is sensitive to the choice of the 
separate Initial priors. A subtle but potentially damaging argument Is 
also to be found In the catenation and recomputatlon of conditionals under 

, when an alternative hypothesis 1s not represented among the 
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item distractors. Unfortunately, on both counts It Is the Bayeslan system 
of probabilistic assessment Itself which forces this to occur. 

TECHNIQUES FOR DIAGNOSTIC RESEARCH FROM OTHER AREAS 

Decision Path Method 

At any point beyond the entry point In a sequential test, an 
additional set of conditional probabilities which are potentially Important 
are required. Not only are there conditional interrelationships among the 
[h] and {)(] . but also among the paths which Jed up to the particular step 
in the test sequence and the action-, taken by the examiner at each step. 
An applied extension of Bayeslan analysis to decision paths is found in the 
field of research in diagnostic medicine The decision tree analysis 
illustrated by Fink and Galen (1981) Invokes a Bayeslan framework operating 
with compound conditionals: 

where {h} - the hypotheses allowable within a given situation, {a} « the 
actions to be taken within the situation, Path^ - the path from preceding 
selections which led to the current condition, and the result of 
selecting a particular action. This result leads to further data which 
then allows refinement of the probability estimate for Hi . The multiple 
conditioning terms lead to extensively annotated decision trees, for which 
information is available about the relative values of selecting one option 
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over another 1n terms which Includes the sequence of those options, their 
cost, their efficiency, and their measurement certainty. Figure 4 
Illustrates the calculation for the decision tree method schematically. 



Figure 4 

Sequence of testing - decision tree model, 
multiple hypotheses in|Hj 



Tstart 



Admi Twister next item or Item setf- 
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Change 1 
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1a Yes 



Evaluate response in light of hypotheses 




An obvious impiication of Bayesian path analysis is that, when given 
fully elaborated baserates, a researcher can construct a fully elaborated 
decision tree which includes each possible diagnosis, all possible 
interactions among diagnoses, and cost-efficiency assessments. An obvious 
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hitch In applying the system to educational testing is the profound lack of 
reliable baserate data for all but the least complex diagnostic hypotheses 
likely to be explored. Additionally, distinctions between various paths 
may be far less profound in the context of educational testing than in 
diagnostic medicine. 
Inexact Reasoning Method 

In a variety of settings, evidence about prior probabilities is 
relatively limited. If the priors can be estimated, we can draw on a 
system for hypothesis evaluation called the method of inexact reasoning, 
which accounts for the lack of exactitude in the establishment of priors. 
It was developed by Shortliffe and Buchanan (1975) in the context of the 
well-known MYCIN automated medical diagnosis program. Its prime concern is 
with the strength of evidence, rather than a perfect match between evidence 
or behavior and hypotheses. Three separate terms are required: one a 
measure of belief and another a measure of disbelief, expressed as 
conditional statements, plus a term which reflects the difference between 
belief and disbelief: . 

mi« [1,0] - pCO ^iKtNMK^. 

In many ways the notation appears to more closely reflect the psychological 
mindsets and inductive decision processes used by practicing clinicans than 

9^ 1 :i 
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the preceding methods, which are formal ly more exuct (an Important point 
discussed later in this paper). 

Originally thlr model was put forward as a sysi,em of approximating 
conditional probabilities, suitable for circumstances characterized by data 
which tends to be subjective rather than objective. Since very few of the 
natural sciences have exact data in the strict sense required by Bayesian 
conditionals, the reasonableness of pursuing approximations seems assured. 
Moreover, many outcomes of a decision process are not even at the same 
level of rough granularity as the data used in that process; that is, the 
number of remedial options available to an examiner are fewer than the 
number of diagnostic clusters for performance of an examinee. Thus an 
approximation, 1f adequate, can provide completely sufficient guidance to 
the examiner for the purposes at hand. At minimum the approximation should 
provide a basis for corroborating human judgments of logical premises, 
actions, and consequences. Indeed, the model was incorporated into a 
highly regarded artificial intelligence approach to medical decision making 
which itself has seen extensive developme^it and generalization. Figure 5 
illustrates this approach at work. 
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Figure 5 

Sequence of testing - Inexact reasoning model, 
multiple hypotheses In-JH] 




Problems with the approach are unavoidable. Adams (1976) elaborated a 
series of theoretical objections whfch focus on the direct relations - not 
immediately i^bvlous between MB^' MD. CF. and conventional Bayesian 
solutions to Wl ally adjusted probabilities. Again, because of Bayesian 
logic, one can Xoidly arrive by (^omputation at untenably small conditional 
probabilities even\ihen intuitive/ logic suggests otherwise. The strongest 
theoretical failing lies in the assumption of independence of {h]; any 
interdependence goes unaccounted in MB. MD. CF. As CF constitutes a 

1 
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weighting factor Its role In practical applications of MB and MD 1s 

multiplicative, but, Adams claims, "not true In general." 

The fact that In trying to create an alternative to 
probability theory or reasoning Shortliffe and 
Buchanan duplicated the use of standard theory 
demonstrates the difficulty of creating a useful and 
Internally consistent system which Is not isomorphic 
to a portion of probability theory (p. 185). 

Ranked Probability Method 

At the lowest end of the spectrum in terms of conditional complexity 

is a method which requires no more than weakly ordered priors of the form 

p{Hi) > p{Hi + i). In a variety of settings the researcher labors with 

unknown (and potentially unknowable) data about which only a minimum degree 

of Information can be stated with confidence.' In such conditions of 

incomplete knowledge, Kmietowicz and Pearson (1981) have spelled out a 

decision theory, and Horbar (1983) has illustrated an application to 

medical diagnosis. Using Horbar' s approach, we can state the following: 



From a series of tables, generated by a procedure involving random sets of 
priors and conditionals, the user determines the probability that a given 
ordering of {"hJ shown by the examinee's responses reflects an expected 
ordering. For example, the order H3>H2>Hi has a substantially smaller 
posterior probability in reference to the expected sequence Hi>H2>H3 than 
does the order H2>Hi>H3 . Figure 6 Illustrates this approach at work in a 
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Figure 6 

Sequence of testing - ranked probability model, 
ranking of multiple hypotheses 



>tart 



Admin 



Ister next Item or Item set \ 




Change Itejvjeiection 
NO-. — Criwfa Yes 




4 Evaluate response In llgKt of hypotheses | 





Of Itsm 
♦<5r hypothea^s?- 
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hypothetical testing situation. 

At Its heart the ranked probability method Is a Bayeslan procedure 
with a loosening of terms. It assumes that the examiner can reasonably 
generate an expected sequence for-^Hl', It also assumes that the elements of 
J are mutually exclusive. Of concern here Is that the tables themselves 
i,iay be contingent In Important ways on the original procedure which 
produced them (Horbar, personal communications). Additionally, no account 
Is made of reliability of the evidence or of guessing behaviors and other 
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nonrandom choices by the examinee. A great deal of work has been produced 
In the area of decision theory, but extensions of such methods to 
situations Involving Incomplete knowledge are very scarce. An alternative 
system which addresses Incompleteness mathematically is found In a terse 
monograph by Vesely and Vajda (1971). Further developments are essential. 
Cascaded Inference Method 

In situations with conflicting evidence such as are Hkely to be 
generated by a diagnostic test, it would be exceedingly helpful to have a 
system of analysis which takes account of the conflict and 1n particular 
the degree to which a given item response x relates to discrimination 
among the set of diagnostic hypotheses . 

For obvious reasons, the problem of developing conclusions from bits 
of evidence — some corroborating, others contradictory, some useful, other 
useless, some fresh, others redundant -- has been of interest to research- 
ers in jurisprudence. In the typical setting, a jury faces multiple and 
deliberat(;ly conflicting sources of Information testimony by witnesses 
for the prosecution and the defense, documentation, photography, statements 
by court and counsel, and must develop a collective judgment as to a binary 
-[hJ consisting of "guilty, not guilty." The Bayesian system of mathematic- 
al logic collapses under the demands of inferential reasoning required 
here; for example: 

...testimony requires [a jury member] to assess the 
likelihood that the defendent was, in fact, at the 
scene/time. This foundational stage Involves evalua- 
tion of the witness's credibility. Then, assuming the 
defendant at the scene/time of the crime, one must 
assess how strongly this event bears on the issue of 
whether or not the defendent committed the crime. 
Further difficulty is presented by intricate patterns 
of reasoning which require the joint consideration of 
current evidence with one or more previously given 
piece of evidence (Schum 4 Martin, 1982, p. 106). 
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Schuffl draws on a Baconian approach to inductive reasoning explicated 

by L. Jonathan Cohen (1977, 1982) which allows direct estimations of 

probabilities for inference structures. Inference structures, which are 

found 1n all forms of human reasoning, run as follows: 

I have an assertion about x, which I read with some 
degree of skepticism, and which I take as a reflection 
on facts or events which in time I combine to assess 
the "ma;;or or ultimate facts-at-issue." 

In a jury setting, a witness gives testimony about the crime that 
occurred. It may consist of an event which can be linked directly to guilt 
or innocence of the defendent ("I saw him rob the lady"). Such first-order 
relations of x to {h} are remarkable because they are so rare. Witness 
testimony is more often of a fact that may or may not be interpreted as 
probative of events which may or may not be linked Incompletely to guilt or 
Innocence ("I heard a scream and saw someone running"). These compound 
cascades of Inference to facts-at-issue are represented by extensions of 
the likelihood ratio 



A 



where |x} is the set of evidence accrued to date, is the portion of 
that evidence in corroboration with the testimony to date, ^X-^ is the 
portion of that evidence in contradiction to the testimony to date. 

The cascaded likelihood ratio has interesting properties, notably the 
use of terms which speak to the contribution of x to the set of evidence 
^x] pointing to . Ttie first of these resides in the term which 
contrasts the relation of x to minus the relation of x to 
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(x? H^, 1n comparison to the relation of x to (x-»", rij) minus the relation 
of X to (x-, Hj). Tbese elements relate the degree of specificity of x 
to Hi . Trie selectivity of x to ^X-,{Hi or Hj)} is an expression 

of how distinctly x discriminates itself from the portion of -[x} which is 
contradicted by x . In more familiar language, the combinations formed 
here address true positives, false positives, true negatives, and false 
negatives. 

Figure 7 is a demonstration of cascaded inference at work. In the 



Figure 7 

Sequence of testing - cascaded Inference model, 
multiple hypotheses In |H} 
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figure, the Individual elements of evidence are annotated by a subscript 
Indicating when the evidence Is acquired. The full complexity of the 
calculations is not shown. 

The analogy of cascaded Influence In jurisprudence to cascaded 
Inference In diagnostic testing can seen in the following example. A 
student behaves In a fashion scored by x , which while not to be taken as 
a direct assertion of , Is seen as a symptom of Hi and thus an 
element of confirmation to |x*. Hi], and an 'element In disconflrmatlon to 

Hj}. A student's erroneous response to a math test Item Is scored as 
symptomatic of a logical misunderstanding of how to carry digits In two- 
digit subtraction: the examiner Includes this -esponse In forming an over- 
view of that student's pattern of responses across the test, but weights 

this response by 

- the degree to which the response Is probative for Hi 

- the degree to which the response In consistent in X+ and^, h}j- 

- the degree to which the response Is contradictory to X+ and^, 
"1} 

Because of Baconian probability techniques, a relatively low-frequency 
response may contribute effectively to discriminating among {H| . and to 
directing the examiner to choosing a suitable item which may also have low 
(though nonzero) hypothesis likelihoods. 

The underlying logic of the cascaded inference model and its 
explication of inference structures appears to closely resemble the logic 
and inference structures used by juries. By extrapolation, the same logic 
and inference structures describe the task of an educational or 
psychological diagnostician. At the present time, however the cascaded 
Inferences model has not been tried with educational test data. 
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Comparison of techniques 
The six analytic schemes presented thus far are chosen to reflect a 
series of constrast in assumptions, prerequisites, processes, and 
outcomes. The sequence portrayed abova Is an attempt to let each new 
method address the fallings of the method that preceded. To begin, SIxtVs 
modification of the sequential probability ratio test shows an accounting 
foi* the probability of responding by chance. Choppln's catenating 
technique Incorporates conditional probabilities beyond the single p(Hi) 
used by Sixtl; these allow one to chain together the evidence of p(Hi,Hj ,K^>. 
Fink and Galen's decision tree moves from unconditional priors to compound 
conditional priors of the form p(x|a,B,...) where A,B... represent 
elements of the context surrounding the p' atlon x — that Is, what 
path was used to arrive at this observation, what* action was taken, and so 
forth. The ranking method of Kmletowicz, put Into practical terms by 
Horbar, Is In theory a relaxation of requirements; where the decision tree 
method requires a great deal of hard evidence, the ranking metho^^can make 
use of knowledge about unconditional prior probabilities that Is much less 
complete. The Inexact reasoning method of Shortllffe and Buchanan attempts 
to portray both uncond11»nal prior and conditional estimates of probability 
In a system that also loosens the need for exact or strictly ordered data. 
The cascaded Inference method, developed from the work of Cohen by Schum 
and colleagues, attempts to correct the restrictions of Bayeslan 
probabilistic reasoning, to allow hierarchical and nested hypothesis 
evaluation. 

As noted at the outset, the various techniques differ markedly In 
their requisite assumptions and scope. Figure 8 presents a listing of 
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considerations as to further assumptions of these methods. Looking solely 
at the probability prequisites for each method, we find that they differ 
markedly in their treatment of priors and conditionals. The modified 
binomial method reVles on a single prior and not at all on conditionals. 
The catenating method relies on three separate priors.and not all on con- 
ditionals. The decision tree method relies on a compound conditional but 
not at an on unconditional priors. The ranking method starts from a 
weakly-ordered set of priors to estimate conditionals. The cascaded 
inference method utilizes both priors and conditionals. 

One important assuniption concerns the independence of hypotheses - 
are members of the set{H] mutually exclusive or can they overlap? Along 
the same lines. *rft observations x of the set of evidence {x} allowed to 
be partially or completely redundant, or must each observation be treated 
uniquely? The process by which each method proceeds is Bayeslan with the 
notable exception of cascaded inference. (Further research is required as- 
to how Baconian techniques may be brough to bear on the operation of the 
first five methods otherwise unmodified). At present, none of the methods 
handles the possibility of both unreliable data and unreliable behavior on 

the part of the examiner. 

What is most interesting from the point of view of diagnosis is how 
each method enables one to evaluate the probative value of each piece of 
evidence - that is, what term or expression (or change in terms or 
expressions) occurs at each step in the testing process such that the 
examiner sees how the last observation acquired has affected the 
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utli . ..jJC Figure 8 

Comparison of probabilistic techniques: 
prerequisites, processes » and outcomes 
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SPRT 


Modified 
Binomial 


Catenating 


Decision 
Tree 


Ranking 


Inexact 
Reasoning 


Cascaded 
Inference 


Reference: 


Wald 


Sixtl 


Choppin 


Williams 


Kmletowicz 


Short! If fe 


Schun 
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relative standing of competing hypotheses. The output of the modified 
binomial technique Is a set of likelihood coefficients A^, one for each 
hypothesis. The output of the catenating Bayeslan method Is a set of 
probability estimates p{H) , one for each hypothesis, and a single term 
expressing the degree of uncertainty about their relative standings, 5CF . 
The decision tree method outputs a probability estimate for every branch of 
the tree, allowing each hypothesis to be evaluated in context. Tlie ranking 
method outputs a simple probability estimate for the entire set of 
hypotheses P(JUi»kX which shows the probability that the given ranking 
reflects the initial estimate of ranking of competing hypotheses. The 
Inexact reasoning method outputs three separate terms per hypothesis at 
each step of the testing process; the last of these terms, CF^^ , 
expresses tha certainty with which the examiner can accept each 
hypothesis. The cascaded inference method outputs a likelihood coefficient 
for the hypotheses taken simply and taken Jointly. 

Four of the six methods are shown in Figure 9 as they step through a 
simulated testing session with very restrictive assumptions. For compar- 
isons the results of the SPRT method are also shown. The e mi nee is 
presented only three choices for each of ten items; choice x^ Is a re- 
flection of hypothesis , without guessing. A simulated testing 
session 1s used for which the examinee begins and ends with errors of type 
1, but touches on other error types as well during the middle of the test- 
ing sequence; the examinee's response sequence is -^1,2,3,3,2,1,2,1,1,1.^ 
Initial values were set at .6 for p(xi) — , and .2 for p(Xi) —♦Hj 
ii set at .25 ,^at .10 . For illustrative purposes, computations are 



:j 1 



- 29 - 



Figure 9 

Comparison of stopping using simulated data: 



Step s 
Response by examinee x 


12 34567 89 10 
1 2 3 3 2 1 2 1 1 1 


SPRtI Hi 
H3 


3.00 1.50 .75 .37 .19 
.50 1.50 .75 2.25 6.75 shf 
.50 .25 .75 2.25 TTIT 


Modified 
b1nom1al2 Hi 

-^(f H^ 


2.20 1.76 1.41 1.13 0.90 1.98 1.59 
0.80 1.76 1.41 1.12 2.48 1.98 4.36 
0.80 1.76 1.41 3.10 2.48 1.98 TTW 


Catenating^ Hi 
S.E.F.* 


0.60 0.43 0.33 0.20 0.14 0.33 0.20 0.69 0.87 5^ 
0.20 0.43 0.33 0.20 0.^^ 0.33 0.60 ^ 0.23 OTTO 
0.20 0.14 0.33 0.68 0.43 0.33 0.20 0.07 0.03 

0.41 0.43 0.48 0.47 0.43 0.48 0.43 0.34 0.20 


Inexact 

reasoning^ Hi 
Cf "2 


.41 .01 -.23 -.37 -.48 -.22 -.27 -.23 -.05 0 
-.40 -.40 .01 -.23 -.37 -.13 -.22 -.08 -.17 -.18 
-.40 -.64 .23 -.01 -.13 -.27 -.27 -.31 -.33 -.34 


Ranking^ Weak IHl^ 
fOiNiO Strong/j>t>» 


n/a7 n/a n/a 0.20 0.15 n/a 0.39 1.00 W 

n/a n/a n/a 0.18 0.18 n/a 0.15 0739 1.00 y^o 



1 Wald (1947). See formula (1). 

2 Sixtl (1974). See formula (2). 

3 Oioppln In McArthur and Clioppin (1983). See formula (3). 

^ Shanon entropy function (Gleser 4 Collen. 197?,). See formula (4). 

5 Short! Iffe and Buchanan (1975). See fonrula (5). 

6 Kmietowlcz 4 Pearson (1981); Horbar (1983). See formula (6). 

7 Mot appropriate to calculate at this step. 
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shown for the unmodified sequential probability ratio test, which concludes 
step 5 with support for . The modified binomial method concludes at 
step 7 with suppovt for H2 . The catenating Bayeslan method concludes at 
step 10 with support for Hi . The ranking method (shown using an 
estimated order of Hi>H2>H3 ) concludes at step 8 If the ranking Is 
assumed to be weak, step 10 if strong. The Inexact reasoning method fail'' 
to conclude by step 10. (Because the remaining two methods, decision path 
and cascaded inference, require many further initial assumptions, they are 
not Included In this illustration). 

Conclusion 

That the separate techniques fail to agree on where to stop and which 
competing hypothesis to support comes as no surprise. There are numerous 
reasons why agreement between techniques is unlikely. The initial 
statistical prerequisites are numerous, and unevenly taken into account. 
Unconditional priors do not have the same effect as simple conditionals or 
compound conditionals. The Inclusion of each new term predictably affects 
computations, such that in general, with all else held the same, the larger 
number of priors and conditionals the longer It will take to reach the 
stopping rule. Further complications are added If members of {Hjor^xl are 
not Independent, are not unambiguous or not properly targeted to the test, 
and so forth. 

Fischhoff and Beyth-Marom (1983) offer an extensive list of pitfalls 
of hypothesis evaluation: 

- untestable hypotheses (absent, nonevalua table, too complex, 
nonexclusive) 

- wrong component probabilities (m1sr'>presented. miscal ibrated, 
nonconforming) 
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- wrong prior probabilities (Incomplete, fallacious, unrepresentative) 

- wrong likelihood ratios (distorted, neglected, non-causal) 

- incorrect aggregation (rules misapplied, values computed 
extraneously) 

- Inadequate search of evidence (questions non-diagnostic, 
inefficient. Incomplete) 

- uncertain consequences (Inadequate opportunity or resources to 
pursue optimal cause of action) 

In particular, a problem that confronts a diagnostician after assessing the 
available evidence from a test Is how to convert such knowledge Into 
concrete actions. 

"...Knowledge of the possible actions Is essential In 
detennlnlng what Information to gather. Two... Judges 
who contemplated different actions, or evaluated their 
consequences differently, might Justifiably formulate 
different hypotheses and collect different data even 
though they agreed on the Interpretation of all 
possible data (Plshhoff 4 Beyth-Marom, 1983, p. 250). 

None of the techniques portrayed here succeed In addressing all of their 
concerns. 

Sequence considerations, which contribute to the nonlndependence of 
fx) » are taken Into account only by the more complicated methods. None 
explicitly treats the complex relationship between an examinee's ability, 
likelihood of guessing, and performance. Ncne explicitly Indicates to 
which next Item 1s optimal — that Is, optlmallty of branching continues to 
depend on how close the members of -j^Hj are to one another, how rapidly the 
examiner will like to converge on a single H, and how exhaustive a search 
of ^IJ'jcombl nations Is desired. A very fast sequence can be derived If one 
steps through a selection offljfor which all but one are known to be 
exceedingly unlikely for the examinee. The same Is true If one chooses 
liberal values for d , or shapes the stopping rule to favor an 

otherwise Inconclusive outcome. 
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Bayeslan analysis 1s only one of several systems which trt t 
probabilistic data. However, It has been the overwhelming system of choice 
despite repeated objection If only because completely explicated 
alternatives are rare, A system which allowed Incompleteness, 
nonmultlpllcatlve Joint probabilities, and conditional nonlndependence of 
H would be preferable In the context of diagnostic testing^. The Baconian 
system of Cohen appears to meet these needs. For example, Cohen's system 
does not Include mathematical addltlvlty, an Inherent property of Bayeslan 
techniques, so P{Hi) « 0 does not mean that P(Hj) - 1. Conjunction of 
probabilities, which is multiplicative in Bayeslan analysis, is handled by 
taking the minimum p(H) ■ ?{Hii\ H2 H ...) - m1n(P(Hk)). 

Remaining for further study is how the rules of Baconian probability 
manipulations might apply to the Bayeslan techniques presented here. A 
closely related issue Is whether the Bacor.ian system is as sensitive to 
the choice of prior probabilities as the exact Bayeslan systems which are 
shown above. 

A further set of Issues about statistics for diagnostic testing con- 
cerns a facet of test design mentioned only fleetingly in this paper: the 
relations of items to ability 0 . Indeed, only if the hypotheses are 
well-bounded and the choices. for test items are demonstrably associated 

1 Incompleteness: ?{e\?^) ■ 0 and P(E|F-) • 0 are allowed; 
In Bayeslan analysis if P{^\F^) - 1, P(E|F-) must equal 0. 

Nonmultlpllcatlve joint probabilities: the joint occur- 
ence of two restively rare events need not be less than 
their separate occurence; in Bayeslan analysis, soon enough 
the multiplicative rule leaves any hypothesis p(H) supported 
less than p ■ .5. / 

Conditional nonlndependence of {h} : hypotheses may be evalua- 
ted even If they overlap, or Incompletely requested at each 
stage in a hierarchy of hypotheses. 
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with those hypotheses will a diagnostic Inference system succeed. That Is, 
If the hypotheses available for assessment are unproductive (Ill-suited, 
poorly framed, highly redundant, or otherwise off target), no amount of 
statistical manipulation will rescue the examiner from a possibly erroneous 
and certainly frustrating conclusion. Likewise, if the choices available 
to an examinee are poor reflections of good hypotheses, the examiner will 
also experience no closure at all, or potential diagnostic inaccuracies If 
closure is reached. 
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